Introduction

library(tidyverse)
library(ggplot2)
library(ggpubr)
library(kableExtra)
library(weathermetrics)

Let’s load the mpg dataset from ggplot2. mpg contains a subset of the fuel economy data. The data frame has 234 rows and 11 variables:

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

The glimpse() function can be used to get an overview of the dataframe structure. It is similar to str(), but more readable.

glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
# for comparison
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Question 1

Are there any variables that do not have the class you expect it to?

The categorical variables are listed as character variables . The best practice is to convert these categorical variables to factor variables. The advantage of using factors are a few, firstly, they can be correctly understood and used is statistical modeling, secondly, as studied during the lectures, some interesting graphics may use factor variables. We should convert these variables and expect the class factor instead.

Question 2

Make any changes you feel are necessary.

# method 1: using lapply()

# find and return chr variables by:
# 1. use sapply() and is.character to find chr columns and return them as logical (TRUE is chr)
# 2. filter the original dataframe to select only the chr variables
# 3. obtain the chr variable names with colnames()
chr_variables <- colnames(mpg[,sapply(mpg,is.character)])

# convert chr variables to factors
mpg[chr_variables] <- lapply(mpg[chr_variables], as.factor)

# method 2: using mutate_if(), quicker and much more readable
mpg <- mpg %>% 
  mutate_if(is.character, as.factor)

glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <fct> audi, audi, audi, audi, audi, audi, audi, audi, audi, aud…
## $ model        <fct> a4, a4, a4, a4, a4, a4, a4, a4 quattro, a4 quattro, a4 qu…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
## $ drv          <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
## $ class        <fct> compact, compact, compact, compact, compact, compact, com…

Exploratory data analysis

Non-Graphical EDA

Below hwy is summarised by vehicle class, by asking for commonly used summary statistics.

mpg %>% 
  group_by(class) %>%         # by class
  summarise(min = min(hwy),   # minimum of hwy
            mean = mean(hwy), # mean
            sd = sd(hwy),     # standard deviation
            max = max(hwy),   # maximum
            n = n())  %>%     # nr of observations
  kbl() %>%  
  kable_styling(latex_options = c("striped", "hover"), full_width = F)  
class min mean sd max n
2seater 23 24.80000 1.303840 26 5
compact 23 28.29787 3.781620 44 47
midsize 23 27.29268 2.135930 32 41
minivan 17 22.36364 2.062655 24 11
pickup 12 16.87879 2.274280 22 33
subcompact 20 28.14286 5.375012 44 35
suv 12 18.12903 2.977973 27 62

Question 3

Create a summary of engine displacement by year, including the minimum, maximum, median, and inter quantile range.

mpg %>% 
  group_by(year) %>%                    # by year 
  summarise(min = min(displ),           # minimum of engine disp
            max = max(displ),           # maximum 
            Q1 = quantile(displ, 0.25), # 1st quartile
            median = median(displ),     # median, 2nd quartile
            Q3 = quantile(displ, 0.75)  # 3rd quartile
  ) %>%
  kbl() %>%  
  kable_styling(latex_options = c("striped", "hover"), full_width = F)  
year min max Q1 median Q3
1999 1.6 6.5 2.2 3.0 4.0
2008 1.8 7.0 2.5 3.6 4.7

Graphical EDA

# testing base R functions plot(), hist(), barplot()
hist(mpg$displ)

barplot(table(mpg$cyl))

plot(x = mpg$displ, y = mpg$hwy,
     xlab = "Highway mpg",
     ylab = "Engine displacement (L)")

General ggplot template:

```{r}
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>)) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <THEME> 
```
ggplot(iris) + # Data
  geom_point(mapping = aes(x = Sepal.Length, # Variable on the x-axis
                           y = Sepal.Width, # Variable on the y-axis
                           colour = Species)) + # Legend 
  labs(x = "Sepal Length",
       y = "Sepal Width", 
       title = "Relationship between Sepal Length and Width by Species") +
  coord_cartesian() + # Default standard for mapping x and y
  facet_wrap(~Species) + # Splits plot by the species variable
  theme_bw() # Sets the background theme

Visualising distributions of single variables

Categorical variables

Bar chart

ggplot(mpg) +             # data element (mpg)
  geom_bar(aes(x = drv)) # geom_function (bar plot)

Question 4

Create a bar chart that shows a count for the different vehicle classes in mpg.

ggplot(mpg) +               # data element (mpg)
  geom_bar(aes(x = class))  # geom_function (bar plot) with variable 'class' to be shown on axis x.

Question 5

Look up different ggplot themes, and apply one to the bar chart you just created.

ggplot(mpg) +               # data element (mpg)
  geom_bar(aes(x = class)) + # geom_function (bar plot) with variable 'class' to be shown on axis x
  theme_minimal()

ggplot(mpg) +               # data element (mpg)
  geom_bar(aes(x = class)) + # geom_function (bar plot) with variable 'class' to be shown on axis x
  theme_pubclean()

Continuous variables

Histogram

ggplot(mpg) +
  geom_histogram(aes(x = cty), 
                 binwidth = 3)

Question 6

What happens when you change the value of binwidth to 1? Does the distribution change?

ggplot(mpg) +
  geom_histogram(aes(x = cty), 
                 binwidth = 1)

A binwidth of 1 is thinner than the one of 3, meaning it corresponds to a shorter interval of values. The distribution does not change significantly, but it becomes less smooth. It also reveals two interesting patterns: first, a peak at around cty = 21, almost as high as the interquartile ranges of Q1 = 14 and Q3 = 19, which was not visible in the previous histogram. Secondly, the outliers on the tail become more visible, there are two values of cty greater than 30 with very few cases that could be investigated.

Density plot

ggplot(mpg, aes(x = cty)) +
  geom_density(fill = "darkseagreen") 

Question 7

Add rug marks to plot above by adding the argument + geom_rug(size = 1, colour = “darkorange”)

ggplot(mpg, aes(x = cty)) +
  geom_density(fill = "darkseagreen") + 
  geom_rug(linewidth = 1, 
           colour = "darkorange")

Visualizing the distributions of multiple variables

Continuous - Continuous

Scatterplot

# geom_point plots values of displ on the x axis and values of hwy on the y axis
ggplot(mpg) +
  geom_point(aes(x = displ, 
                 y = hwy))

Adding a third variable as colour.

ggplot(mpg) +
  geom_point(aes(x = displ, 
                 y = hwy, 
                 colour = class))

Question 8

Repeat the plot above, but mapping the type of transmission to the colour aesthetic. Add a theme and change the titles of the x- and y-axes.

# title of axis can be accessed by `labs` (labels)
ggplot(mpg) +
  geom_point(aes(x = displ, 
                 y = hwy, 
                 colour = trans)) +
  labs(x = "Engine Displacement in litres",
       y = "Highway miles per gallon") +
  theme_minimal()

Question 9

Create a scatter plot showing the relationship between engine displacement and city miles, including a line of best fit.

# title of axis can be accessed by `labs` (labels)
# aes() must be inside ggplot now, x and y are required by geom_smooth()

ggplot(mpg, aes(x = displ, 
                 y = cty)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Engine Displacement in litres",
       y = "City miles per gallon") +
  theme_minimal()

Question 10

Recreate the previous plot, but adding colour = class to the aesthetic mappings. Use method = “lm” and set se = FALSE in geom_smooth(). What difference does this make?

# title of axis can be accessed by `labs` (labels)
# aes() must be inside ggplot now, x and y are required by geom_smooth()

ggplot(mpg, aes(x = displ, 
                y = cty,
                colour = class)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Engine Displacement in litres",
       y = "City miles per gallon") +
  theme_minimal()

There are three modifications. First, adding colour to the aesthetic mappings creates clusters using the class variable. As a result, geom_smooth() fits separate lines for each of these colour clusters. Second, the method = "lm" argument stands for linear model and finds a linear relationship in the data, fitting a linear line, instead of a line full of curves. Third, the argument se = FALSE displays no confidence interval around the smooth.

For comparison, let’s try plotting without se set to FALSE.

ggplot(mpg, aes(x = displ, 
                y = cty,
                colour = class)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Engine Displacement in litres",
       y = "City miles per gallon") +
  theme_minimal()

The plot is less readable with the display of overlapping confidence intervals.

Continuous - Categorical

Boxplot

Boxplots are often used to explore the distribution of a continuous variable broken down by a categorical variable.

ggplot(mpg) +
  geom_boxplot(aes(x = drv, 
                   y = hwy))

These features of the boxplot are:

  • Lower horizontal line
  • Thicker horizontal middle line
  • Upper horizontal line
  • Vertical whiskers
  • Points beyond the whiskers

Question 11

What do each of these features tell us?

  • Lower horizontal line: represents the 1st quartile, or 25th percentile
  • Thicker horizontal middle line: the median, represents the 2nd quartile, or 50th percentile
  • Upper horizontal line: represents the 3rd quartile, or 75th percentile
  • Vertical whiskers: the interquartile range (IRQ) is the range between Q1 and Q3. The whiskers are generated using the IQR and a measure of flexibility which is generally 1.5. The inferior whisker is defined as (Q1 - 1.5 * IQR), that is the 25th percentile minus one and a half times the interquartile range. The superior whisker as (Q3 + range * IQR), the 75th percentile plus one and a half times the interquartile range
  • Points beyond the whiskers: represent outliers.

Question 12

Add + coord_flip() to the plot above. What does this do?

ggplot(mpg) +
  geom_boxplot(aes(x = drv, 
                   y = hwy)) +
  coord_flip()

coord_flip() flips the x and y axes, making the reading of bloxplot charts horizontal instead of vertical.

Question 13

What can you conclude about the highway miles per gallon of each type of drive train from the boxplot above?

Drive train “r” (rear-wheel) has a median of highway miles per gallon of a bit greater than 20. There are no outliers for this type of drive.

“f” stands for drive train type front-wheel. The group has the highest median of all three types of drive trains, a bit less than 30 highway miles per gallon. It also has many outliers, the dots beyond the whiskers, both beyond the maximum and minimum whiskers.

Drive train type “4” has no outliers. The group has the smallest median, making less than 20 highway miles per gallon.

Question 14

The variable trans contains 10 different transmissions types. These 10 categories can be assumed under 2 broader categories: manual and auto. Use mutate() and fct_collapse() to collapse the 10 categories of trans to just these 2 categories.

# find the unique transmission types
unique(mpg$trans)
##  [1] auto(l5)   manual(m5) manual(m6) auto(av)   auto(s6)   auto(l4)  
##  [7] auto(l3)   auto(l6)   auto(s5)   auto(s4)  
## 10 Levels: auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) ... manual(m6)
mpg <- mpg %>% 
  mutate(trans = fct_collapse(trans,
                             "manual" = c("manual(m5)", 
                                          "manual(m6)"),
                             "auto" = c("auto(av)", 
                                        "auto(l3)", 
                                        "auto(l4)", 
                                        "auto(l5)", 
                                        "auto(l6)", 
                                        "auto(s4)", 
                                        "auto(s5)", 
                                        "auto(s6)"))) 

# check if mutation works
unique(mpg$trans)
## [1] auto   manual
## Levels: auto manual

Question 15

Use the previous example to create a boxplot mapping cty on the y-axis and drv on the x-axis, this time adding the argument colour = trans to aes(). What has changed?

ggplot(mpg) +
geom_boxplot(aes(x = drv, 
                 y = cty,
                 colour = trans))

Adding a colour variables to the aesthetic of the boxplot splits each drive train into two groups: the manual group and the auto group. Since we originally had 3 drive train groups, the plot now has 6 boxplots. Drive train ‘f’ remains with outliers, both in auto and manual trans. Drive train ‘r’ remains without outliers. Drive train ‘4’ originally had no outliers, but splitting in trans groups revealed outliers in the auto group. The median varies between trans groups for all drive train types - manual has a higher median of highway miles per gallon in all cases.

Facets

Facets are a way to split your plot into many subplots according to some categorical variable.

To do this, you use the command facet_wrap() to facet by a single variable. The first argument to facet_wrap() is a formula initiated by ~ and a variable name.

ggplot(mpg) + 
  geom_point(aes(x = displ, 
                 y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Let’s plot the previous example without the facet for a fresher visual comparison.

ggplot(mpg) + 
  geom_point(aes(x = displ, 
                 y = hwy))  

It is possible to facet on a combination of two variables using facet_grid(). Like before, the first argument is a formula, but containing two variable names separated by ~.

ggplot(mpg) + 
  geom_point(aes(x = displ, 
                 y = hwy)) + 
  facet_grid(drv ~ cyl)

Question 16

Notice that there are empty cells in the bottom left facets above. What do you think this means?

It means there are no data points in the specific combination of the two categorical variables used in the facet grid, that is to say, there are no drive trains of r type (r = rear wheel drive) with 4 cylinders, nor with 5 cylinders. There are also empty cells in the facets that represents the combination of drv = 4 (4wd) and 5 cylinders.

Question 17

Create a scatter plot of displ and hwy, faceting by the of manufacturer name.

ggplot(mpg) + 
  geom_point(aes(x = displ, 
                 y = hwy)) + 
  facet_wrap(~ manufacturer, nrow = 2)

Using nrow = 2 consolidates all facets in two rows. This is not a great choice for two reasons. First, there are a total of 15 different manufacturers, an odd number, making the layot of the facets asymmetric. Second, the plots and crushed to fit the display in only two lines, making them less readable. A nice choice is to use nrow = 3, a number by which 15 is divisible, generating a better layout and giving more horizontal space for each plot.

ggplot(mpg) + 
  geom_point(aes(x = displ, 
                 y = hwy)) + 
  facet_wrap(~ manufacturer, nrow = 3)

Question 18

Change the names on the axes to be more informative.

ggplot(mpg) + 
  geom_point(aes(x = displ, 
                 y = hwy)) + 
  labs(x = "Engine displacement, in litres", 
       y = "Highway miles per gallon") +
  facet_wrap(~ manufacturer, nrow = 3)

Arranging multiple plots

If you have multiple plots that you want to arrange on the same page, you can use ggarrange() from the ggpubr package.The first step is creating some plots.

Question 19

Create the following three plots: A. A bar plot showing counts of class, assigning class to the fill aesthetic.

B. A box plot of class by hwy per gallon, assigning class to the colour aesthetic.

C. A jittered scatterplot of hwy and cty miles per gallon, assigning class to the colour aesthetic.

Remember to save each plot as an object to be called upon later.

# plot A - the `fill` argument attributes a color to each class
plot_a <- ggplot(mpg) +
          geom_bar(aes(x = class, fill = class)) +
          labs(x = "Type of car",
               y = "Count") +
          theme_light()

# plot B
plot_b <- ggplot(mpg) +
          geom_boxplot(aes(x = class,
                           y = hwy,
                           colour = class)) +
          labs(x = "Type of car",
               y = "Highway miles per gallon") +
          theme_light()

# plot C
# The jitter geom is a convenient shortcut for geom_point(position = "jitter"). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.

plot_c <- ggplot(mpg) +
          geom_jitter(aes(x = hwy,
                              y = cty,
                              colour = class)) +
          labs(x = "Highway miles per gallon",
               y = "City miles per gallon") +
          theme_light()

D. Use ggarrange() to arrange the three plots in one space. ggarrange() takes the plots as arguments as well as ncol or nrow to customise the arrangement

ggarrange(plot_a,
          plot_b,
          plot_c)

The arranging of the plot can be improved by flipping the coordinates so that the names of the classes in the first two plots are readable. Let’s rewrite the code for A and B adding this property.

# plot A - the `fill` argument attributes a color to each class
plot_a_flip <- ggplot(mpg) +
          geom_bar(aes(x = class, fill = class)) +
          labs(x = "Type of car",
               y = "Count") +
          coord_flip() +
          theme_light()

# plot B
plot_b_flip <- ggplot(mpg) +
          geom_boxplot(aes(x = class,
                           y = hwy,
                           colour = class)) +
          labs(x = "Type of car",
               y = "Highway miles per gallon") +
          coord_flip() +
          theme_light()
ggarrange(plot_a_flip,
          plot_b_flip,
          plot_c,
          ncol = 2, 
          nrow = 2)

E. Repeat the code from the previous question, adding common.legend = TRUE and legend = “bottom” to ggarrange().

common.legend = TRUE avoids repeating class legend and makes the figure more readable.

ggarrange(plot_a_flip,
          plot_b_flip,
          plot_c,
          ncol = 2, 
          nrow = 2,
          common.legend = TRUE,
          legend = "bottom")